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Abstract 



Previous structural equation analyses on moral stage sequence usually evaluated only the fit of the 
theoretical model (e.g., Stages 2->3->4->5) to the data. This strategy is relatively weak because of the 
existence of other equivalent or plausible alternative models which might have identical, or even 
better fit. The present study evaluated an exhaustive list of simplex models as well as a number of 
other plausible alternative models in two domains of moral development over two cultural samples 
(Chinese and British). Results revealed no strong competitors to the theoretical model. Issues related 
to simplex structure and the importance of examining alternative models were also discussed. 



Moral Stage Structure p.3 



Stage Structure of Moral Development: A Comparison of Alternative Models 



The stage sequence and structure in moral development is of great interest to social 
psychologists (e.g., Kohlberg, 1969; Rest, 1986). Using structural equation modelling, previous 
researchers (e.g.. Boom & Molenaar, 1989; Sachs, 1992) fit their data to the theoretical model and 
found some support for the linear development from lower to higher stages (e.g., from Stage 2 to 3, 4, 
5, and 6). However, it is possible that other stage sequences or alternative models will fit and explain 
the data equally well, or even better. The present study re-examined the issue using an exhaustive list 
of simplex models as well as a number of other closely related non-simplex models. These alternative 
models were evaluated by their fit to the data and properness of the parameter estimates. 

Moral Development 

In the study of moral development, the late Lawrence Kohlbqrg has been considered the only 
contemporary psychologist who embraces philosophy as important and essential in the definition of 
what is moral. He posited a 3 levels and 6 stages hierarchical model (2 stages within each level) 
which composed of the stages: heteronomous (Stage 1 ; e.g., avoidance of punishment), individualism 
(e.g., reward), mutual interpersonal expectations, social system (law and order), social contract, and 
universal principles (Stage 6) (Rest, 1983). 

Results from cross-cultural studies (for reviews, see Rest, 1986; Snarey, 1985) generally 
supported the invariant structure of lower moral stages. However, studies using Kohlberg's Moral 
Judgment Interview (MJl) showed that the moral development in other cultures, especially peasant 
villages, was usually slower than that in Kohlberg's U.S. norms (Snarey, 1985). Noteworthily, in 
some studies (Gorsuch & Barnes, 1973; White, Bushnell & Regnemer, 1978) moral reasoning beyond 
stage 3 was totally absent at the age of 16. The invariance of higher stages was not unanimously 
supported (Edwards, 1975;Maqsud, 1979). 

Davison and his colleagues (1977, Davison, Robbins & Swanson, 1978) proposed the use of a metric 
unfolding model to test the hierarchical stage structure. It was posited that a principal components 
analysis of the stage scores should produce two factors. The first factor should have the highest 
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loadings in the intermediate stages while the second factor should have loadings reproducing the order 
of the stages. Davison et al. showed that their subjects displayed a moral development sequence 
matching the theoretical one. 

Except the original work by Davison et al., however, little similar research has been published 
on the reproduction of stage structure with other samples, especially in a cross-cultural setting. Using 
a 4-story short version of the Defining Issues Test (DIT, Rest, 1979), a most popular objective moral 
judgement test (Kay, 1982), Ma (1985, 1988) found that the factor structures as produced by the 
British subjects were unstable across samples and were sometimes inconsistent with the theoretical 
stage order. Furthermore, the order of stages 4, 5, and 6 of Chinese in Hong Kong and Mainland 
China were quite non-discriminating (Ma, 1988; Ma & Chan, 1987). The findings. also suggested a 
cultural difference in that the Chinese tended to perceive stage 4 reasoning as more similar to stages 5 
and 6 rather than to stages 2 and 3 (Ma 1988, Ma & Chan, 1987). Hau and Lew (1989) found 
similarly that the sequence between levels (e.g., between levels I and II) were quite distinct, whereas 
those within a level (e.g., between stages 5 and 6 in Level III) were rather ambiguous. 

Simplex Structure 

Simplex models have been used widely to examine longitudinal data in which the same 
variable is measured repeatedly on the same people over several occasions (Joreskog & Sorbom, 

1988; Marsh, 1993). Davison (1977, Davison et al., 1978) argued that the correlations of the stage 
scores should also display a simplex-like pattern, that is, in any row or column, the correlations 
should fall off when one moves away from the main diagonal. This is a reflection of the fact that the 
sizes of correlations between adjacent stage scores (e.g.. Stages 3 and 4) are larger than those further 
apart (e.g.. Stages 3 and 5). Actually the correlation between any two non-adjacent stages (e.g.. 
Stages 3 and 5) is zero when the effect due to an in between stage (e.g.. Stage 4) is partial led out 
(Joreskog, 1970; Marsh, 1993). 

Consider the path model involving 4 stages (Stages 2, 3, 4, and 5) (see Figure 1). The 
rectangular boxes (y;) represent the observed stage scores (usually means of items for a particular 
stage), whereas ovals (h j) are the latent constructs reflecting the true stage scores, the e s are the 
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errors in measurement, the z s are factor residuals, the 1 s are the factor loadings between observed 
stage scores and the corresponding true (latent) stage scores, and b s are the path coefficient between 
latent stage scores which reflect the proximities of consecutive stages. 



Insert Figure 1 about here 



The perfect simplex model is distinguished from the quasi-simplex one in that the stage scores 
of the former are assumed to be measured without error (Joreskog, 1970; Joreskog & Sorbom, 1988). 

In that case, all e s are zero with the 1 s being fixed to one. In contrast, the quasi-simplex model to 
be used in the present study hypothesizes that the observed stage scores contain a measurement error 
component. 

y =h (+ e t for t = 1 to T occasions 

ht=bth(.i+Zt for t = 2 to T occasions. 

Despite that the quasi-simplex model allows measurement errors and reflects a closer picture 
of real empirical data, there are identification problems in such model. It can be shown 
mathematically that the parameter estimates in the first and last two stage scores cannot be uniquely 
identified. For the model in Figure 1, one indeterminacy is associated with b 2 , y i, y 2 and q i ; 

while the other involves y4 and q4 (Joreskog, 1970; Joreskog & Sorbom, 1988, pp.l82-186)[where 

y = Var( z ) and q = Var( e ) ]. One condition must be imposed on each of these two sets of 
parameters to eliminate the indeterm inacies. 

In earlier works (e.g., Joreskog, 1970, 1977), the above indeterminacy was solved by fixing the 
error terms (q i and q 4 ) at both ends to be zero. This method was adopted by Boom and Molenaar 
(1989) and Sachs (1992) in their analyses. However, in view of the fact that moral judgement 
measures, similar to a lot of other psycho-social indicators, have only moderate reliabilities (.27 to .78 
for individual stage scores; Davison & Robbins, 1978; Hau & Lew, 1989; Rest, 1979), it would be 
unreasonable to set q 1 and q 4 to be zero, which in effect implies no measurement errors in the stage 
scores at the two ends. In more recent works, Joreskog (1981, Joreskog & Sorbom, 1988) suggested. 
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"The most natural way ofeliminating the indeterminacies is to set qi -q2 and q 3 -q4"(1988, 

p.l86). One purpose of the present study is to reanalyse some of the results of previous studies using 
the latter strategy which allows measurement errors in all stage scores. Noteworthily, it can be shown 
empirically that the above two methods of solving indeterminacy lead to the same goodness of fit to 
the data, whereas the parameter estimates can be substantial different (see Table 1). 

In the analyses of longitudinal data using simplex models, Marsh and Hau (1994) have also 
noted that it is necessary to include a priori correlated uniquenesses relating the same indicators 
administered on different occasions. In the models being examined in present study, correlated 
uniqueness terms were not included. This is justified because unlike longitudinal study, there are no 
common indicators across different stages. It is believed that the effects due to correlated uniqueness, 
if any, would be relatively small. 

Alternative and Equivalent Models 

In the evaluation of structural equation models, Bollen (1989, pp. 67-72) and many others (e.g., 
MacCallum, Wegener, Uchino, and Fabrigar, 1993) criticize that there is frequent confusion between 
model-data and model-reality consistency. The former checks whether the data is consistent with the 
data, whereas the latter examines whether the model is consistent with the real world. The link 
between the two is asymmetric in that if the data are consistent with a model, it does not necessarily 
follows that the model reflects the reality. The existence of other competitive models which have 
identical or even better fit to the data cannot be ignored. Rather, they have to be eliminated either 
empirically (showing they do not fit the data) or theoretically (showing they are not logically 
possible). 

In the following we will first discuss equivalent models which have the same goodness of fit. 
Then we will examine other alternative models which may have worse or better fit to the data. As 
regards equivalent models, a trivial case is that the theoretical model and one with exactly opposite 
sequence (e.g.. Stages 5 4 -> 3 -> 2 instead of Stages 2 -> 3 -> 4 -> 5) have identical fit to the data. 

As we are fixing the q s of the first two stages to be equal and those of the last two stages to be equal, 
it can be shown empirical that in this particular model, a reversal of order of the respective two stages 
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will not affect the goodness of fit (e-g-, Stages 2 -> 3 -> 4 -> 5, Stages 3 -> 2 -> 4 -> 5, Stages 3 -> 2 - 
> 5 -> 4 have the same fit to the data, see Table 1). These alternative models can only be 
distinguished and eliminated by examining whether they all converge to proper solutions, which 
include positive measurement errors (q s) and factor residuals (y s), reasonable standard errors, and 
standardized regression (b s) smaller than one. Furthermore, in the simplex model, as b s reflect the 
proximities between consecutive stages, they are hypothesized to take positive values only. 

Other possible alternative models include those that are slightly (e.g., involving reversal of two 
adjacent stages) or grossly different (e.g., involving reversal of non-adjacent stages) from the 
theoretical sequence. In the present study with 4 stages (Stages 2, 3, 4, and 5), there are 24 possible 
combinations of stage order in the simplex model, some of which are equivalent to each other in their 
fit to the data. 

A lot of other alternative non-simplex models can also be generated which might have better or 
worse fit to the data. An example is to have Stages 4 and 5 as alternative paths for development. That 
is, all children progress from stages 2 to 3, but some will proceed to stage 4 and end there while others 
go on to stage 5 directly without passing through stage 4 [represented by: stages 2 -> 3 -> (4, 5)] (see 
M2 in Table). This model in efeect is arguing that stages 4 and 5 are alternate paths of moral 
maturation. 

Sachs (1992) investigated and compared the hierarchical stage order of the Moral 
Development Test (MDT) (Ma, 1987) over two cultures (British and Chinese). However, he 
inspected only the fit of the data to the theoretical model (stages 2 -> 3 -> 4 -> 5). The main purpose 
of the present study was to re-examine the stage sequence by inspecting an exhaustive list of simplex 
models as well as a number of other plausible non-simplex alternatives. These models were evaluated 
by their fit to the data together with the properness of various parameter estimates in the model. 

Method 

The analyses in the present study were based on data collected by Ma (1988, 1989) and 
reported by Sachs (1992). Subjects from two cultures were examined. The 1005 subjects in the 
Chinese sample consisted of 90 Grade (G.) 9, 188 G.IO, 243 G.l 1, 164 G.12, 302 college students and 

best copy available 
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1 8 adults from Hong Kong, whereas the 28 1 British subjects consisted of 60 G.9, 1 1 2 G. 1 0, 6 1 G. 1 2 
students and 48 adults. There were approximately equal number of male and female subjects in each 
sample. 

All subjects completed the Moral development Test (MDT) which assessed both the affective 
(moral orientation) and cognitive (moral Judgment) aspects of moral development (Ma, 1988, 1989). 
The moral orientation (N scores) scales were used to measure subjects’ tendency to gratify 
psychological needs and to perform altruistic acts towards others (Sachs, 1992). Whereas, the moral 
Judgment (J scores) scales were parallel to Rest's Defining Issues Test (DIT) and measured subjects 
cognitive maturity in moral Judgment as defined by Kohlberg (1969). 

The four correlation matrices of the stage scores reported by Sachs (1992) on the two domains 
(moral orientation and moral Judgment) and for the two samples (Chinese and British) served as input 
matrices for the following analyses. Despite the fact that correlation rather than covariance matrices 
were used, this did not affect substantially our results. 

First, in the following particular analyses, goodness of fit indexes were identical irrespective of 
whether correlation or covariance was used. Second, because all indicators were actually means of 
sets of items on the same Likert scale (e.g., 7 points), it is reasonable to assume that their means and 
variances would not differ too much. Actually re-analyses with indicators having slightly different 
means and variances showed that the following conclusions were robust to such variation. Third, the 
setting of equal measurement errors of the first two and last two stage scores in the correlation 
matrices was effectively assuming that the respectively scales had similar reliability (or proportion of 
error variance to true variance). Whereas, if covariance matrices had been used, the setting of equal q 
s was equivalent to the assumption that the absolute error variances (i.e., in the original metric) of 
the scales were identical. Apparently, in this particular analysis (especially when we are not setting 
further constraints across cultural groups), the former assumption of using correlation matrices was as 
good as, if not better than, that basing on covariance matrices. 

Results 



Model Evaluation 
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When evaluating the goodness of fit of the models, the TLI (Tucker Lewis Index) and RNI 
(Relative Noncentrality Index) were chosen because unlike many others, these two indices are 
unbiased in that their expected values do not vary systematically with sample size. The former, 
however, differs in that it embodies a control for model complexity and a reward for parsimony. Both 
incremental indexes have been recommended for routine use in model evaluation (Marsh, Balia, & 
Hau, 1994). 

Models were also checked to see that they converged to solutions with reasonable parameter 
estimates, which included: nonnegative measurement errors (q s), nonnegative factor residuals (y s), 
standardized regression paths (b ) which were positive but less than one. The results of the evaluation 
for the two moral domains (orientation and judgment) and in the two cultural samples are shown in 
Table 1. When a certain model is improper, one of the sources of problems together with the 
respective illegal values were also tabulated (e.g., q = TE — -.87). Sometimes, the parameter estimates 
may fall on the boundary of legitimate domain. These solutions, though proper, may not be 
reasonable and interpretable (e.g., stage score with no measurement error, q = 0). They are labelled as 
worrisome solutions in the table. 

It should be noted, however, that for some less serious problems (such as a negative 
uniqueness close to the zero boundary), it is possible that the analyses of the covariance matrices with 
stage scores having large differences in means and variances might bring the improper solutions back 
to the permissible range. However, a glance through the degree of impropemess in Table 1 suggest 
that the majority of these improper solutions are unlikely to become proper irrespective of the 
differences in the means and variances of the stage scores. 

Simplex Models 

Preliminary analyses started with the comparison of simplex models formed by the two 
strategies (q, = q 4 = 0 and qi = q 2 , 93 = 94 ) used to solve the parameter indeterminacy. As can 
be seen from Table 1, the two methods (MO and Ml) applied to the theoretical model (Stages 2->3- 
>4->5) lead to the same goodness of fit but slightly different parameter estimates. 
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It was noticed that the solutions for the Chinese sample in the two moral domains were 
problematic. For the orientation scales, the two specifications used to solve parameter indeterminacy 
both resulted in improper solutions (due to negative measurement errors/uniquenesses of -.87). On 
the other hand, fitting the theoretical model to the Chinese moral judgment data gives rise to 
worrisome solutions of either zero measurement error or standardized regression path coefficient of 
one. 

Twenty-three additional simplex models (M2 to M24 in Table 1) were generated for each 
domain and cultural sample. These models can be broadly classified as either minor, moderate, or 
serious reversal of the theoretical one. The minor reversals involved only one reversal of two adjacent- 
stages, whereas in the moderate ones, a stage was misplaced two stages from its theoretical order 
(e.g., stage 2 came after stage 4). All other types of reversals, including a complete reversal of the 
theoretical model (i.e., stage 5->4->3->2) were grouped under the sqrious category. Noteworthily, all 
these models had the same degree of freedom (df=l). 

A browse through the minor reversal category showed that the fits of these models were not 
particularly good. Actually all these 12 solutions either nonconverged or were improper. Similarly 
for the moderately and seriously reversed ones, only a few of the solutions were proper. 

Coincidentally, they were all in the British group. 

For the three proper solutions (Mil, M19, M24) of the British subjects in the moral orientation 
domain, the first two were equivalent in that one was just the complete reversal of the other and thus 
they had identical fit to the data. Despite the fact that they had comparable RNI (.98) with the 
theoretical model (MO, RNI=1.00), their chi-square and TLI (3.46 and .86 respectively) were much 
worse (for MO, Chi = .46, TLI = 1.03). Thus, there was no compelling reason to replace the 
theoretical model with these two alternative ones. M24 was just the complete reversal of the 
theoretical one and was expected mathematically to have identical fit to the data. Without other 
substantial empirical or theoretical support, there was no strong ground to accept this as a replacement 
for the theoretical one. 
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Three proper solutions (M5, Ml 6 , M24) were also found in the moral judgment domain of the 
British subjects. M5 and Ml 6 were equivalent in that one was the complete reversal of the other. 

Both of them had goodness of fit as good as, if not better than, the original theoretical model (M5 and 
M16: Chi=1.42, TLI=.98, RNI=1.00 versus MO: Chi=1.81, TLI=.97, RNI=.99). M5 (stages 2->4->5- 
>3) looked closer to the theoretical model whereas M16 (stages 3->5->4->2) involved greater 
reversals of stages. Both models can serve as potential competitors for the theoretical model. M24 
was the complete reversal of the theoretical model and was also proper. As discussed above, there 
was no compelling reason to accept this as a replacement of the theoretical model. 

A closer examihation of the fit indexes in Table 1 shows that the 24 models are actually 
composed of three groups, each with eight models. Within each group, all models are equivalent and 
have identical goodness of fit. As constraints have been set on the error terms at the two ends ( q i = 
q 2 and q 3 = q 4 ), it can be seen from Table 1 that models of the following stage sequence are 
equivalent: A->B->C->D; B->A->C->D; A->B->D->C; B->A->D->C; C->D->A->B; C->D->B->A; 
D_>C->A->B; and D->C->B->A. Irrespective of the sample being used, all these models would have 
the same goodness of fit. However, it should be noted that the parameter estimates of these models 
can and will usually differ. 

Non-Simplex Alternative Models 

It is possible to generate a huge number of non-equivalent non-simplex alternative models as 
competitors for the theoretical model. For example, one can hypothesize that stages 3, 4, 5 are end- 
point themselves, but are alternative pathway from stage 2. That is, there are three regression paths 
going from stage 2 to stages 3, 4, 5 and there is no other path linking the stages. This is represented as 
'stage 2->(2,3,4)’ (see Ml in Table 2). 

Due to the great number of possible alternative models, in the following analyses we are 
limiting ourselves to those that are closer to the theoretical model and have simple structure with only 
three paths linking the four stages (same number as the simplex model). Stage 2 is placed either as 
the first or among the earliest stages in the path diagram. Admittedly it is possible to have other 
grossly different models which have good fit to the data. However, as the theoretical model seems to 
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fit the data relatively well, it is reasonably to expect that the best fit alternative model would have a 
structure not much different from the theoretical one. Thus, the use of the present set of models, 
which closely resemble the theoretical one, as the starting point for exploration is well justified. 

Using the criteria listed above, ten alternative non-simplex models were constructed (see Table 
2). Similar to the indeterminacy problem in the simplex models, not all of the error terms (el to e4) 
could be separately estimated. The problem cannot be totally solved even by setting pairs of the error 
terms to be equal. Thus, in view of the assumption that the reliability of the stage scores would be 
approximately the same, it was decided to set all four error terms to be equal (i.e., q , = q 2 = q 3 = q 4 
). This solved the identification problem in all models except M4. 

However, an inspection of the solutions for moral judgment in the two cultural groups showed 
that the factor residuals were on the boundary (-.06, -.07) (see MOA Table 2). This indicated either 
the theoretical model or the assumption of equal measurement errors rnight not be.appropriate. This 
could not be verified without the support of other empirical data. Nonetheless, the model was 
reanalysed by setting e4 to be slightly smaller than the others (qi = q 2 = q 3 = l-05q4 ) (see MOB 
in Table 2). This led to proper solutions for both domains in the two cultural samples with no change 
in the goodness of fit. Noteworthily, reanalyses of other models (Ml to MIO) using a slightly smaller 
e4 term did not substantially change the results. 

An examination of the solutions in Table 2 shows that a number of them are proper. However, 
none of the models appeared to be strong competitor for the theoretical one. A lot of the proper 
solutions had very poor fit to the data (moral orientation, Chinese: M2, M9, MIO, UK: M7, M8, MIO; 
all proper models in moral judgment). Models M2 and M6 of the British subjects in the moral 
orientation domain had slightly larger chi-square, but comparable TLI and RNI to the theoretical 
model. Disappointedly, the same model when applied to other cultural samples or moral domain did 
not give the same kind of good fit. Nevertheless, M2 and M6 can still serve as potential alternative 
models in the future exploration of moral stage structure. 



Discussinn and Conclusion 



Moral Stage Structure p. 1 3 



The present study evaluated the stage structure of moral development in a wide range of quasi- 
simplex and non-simplex models. In the context of this study, a strong alternative model should be 
one that is applicable in both moral domains and is universal for both cultural groups. Despite the 
large number of models being inspected, an examination of the fit indexes as well as the parameter 
estimates revealed no strong competitor for the theoretical model. Even when we limit ourselves to 
good model for a particular moral domain and cultural sample, only one or two of the models could 
serve as potential competitors. 

Despite the finding of no strong competitor, the conclusion that the theoretical model best 
describes moral development has to be taken with great caution. It is quite possible that a lot of the 
nonconvergent and improper solutions of competitive models were due to the reliability, or rather the 
lack of reliability, of the stage scores. This may lead to an unstable correlation matrix which do not 
fully conform to any simplex structure. That is, irrespective of how. we rearrange the stage order, the 
value in the correlation matrix do not systematically decrease as it moves away from the main 
diagonal. Or, the correlations between any two non-adjacent stages are not close to zero when the 

effect due to an in-between stage is partialled out. 

The indeterminacy problem inherent with quasiT-simple structures worsens the already slim 
chance of good fit because in effect, we have to force pairs of error terms to be equivalent. This, 
however, can be overcome by using multiple indicators for each stage. As advocated by Marsh 
( 1993 ), this much stronger model allows each parameter to be estimated independently, or if 
necessary, restricted to be equal (or according to other constraints). 

The use of multiple indicators may sometimes help to solve the problem of low reliability in 
stage score. For stages consisting of a number of coherent item parcels (composed of a set of 
unidimensional items), the aggregation of all items into one score may result in low reliability. 
However, when models with multiple indicators are used, the item parcels do not have to be 
aggregated. Each of these parcels can have their own contribution to the latent stage factor. This may 
lead to a more appropriate evaluation of different competitive stage structures. 
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As in other structural equation analyses, it is advisable to examine and verify the results of the 
present study using other established instruments and in additional samples and moral domains. 
Furthermore, due to moral immaturity, it is possible that the higher stages may appear to be non- 
differentiating to younger subjects. Thus, separate analyses for different age groups, as carried out by 
Boom and Molenaar (1989), will give additional support to the validity of any hypothesized stage 
structure. 

In contrast to previous structural equation analyses on moral development (e.g.. Boom & 
Molenaar, 1989; Sachs, 1992), the present study evaluates a great number of plausible alternative 
simplex and non-simplex models. This process of simultaneous examination and, perhaps, 
elimination of various alternative and plausible models is very important, but frequently neglected and 
even misinterpreted. For example, Randhawa, Beamer and Lunberg (1993) fit their cross-sectional 
data to one a priori model and postulated a causal structure among mathematics attitude, achievement 
and efficacy. In a comment on this research. Marsh et al. (1994) pointed out the existence of a 
number of equivalent models, including one with direction of causality just opposite to that 
hypothesized by Randhawa et al. Marsh et al. challenged Randhawa et al.'s conclusion because there 
was no basis to differentiate among these substantially different models. However, in responding to 
such criticism, Randhawa and Beamer (1994) clenched to their wrong belief and insisted, "Empirical 
equivalence is something that technicians can have fun with, but it is not the raison d’etre for 
theoretical repudiation" (p.465). 

As exemplified by the above analyses, equivalent models cannot be distinguished in terms of 
their fit to the data. Rather, they can only be differentiated or eliminated by criteria such as 
interpretability of parameter estimates and meaningfulness of the model (MacCallum et al., 1993). 
Sometimes researchers defend their one model analysis and claim that their a priori model is more 
justifiable than other equivalent or alternative models because it is developed from prior research or 
theories. MacCallum et al. (1993) criticized and pointed out that "such defense would often be the 
product of wishful thinking" because "[the argument] implies that no other equally good explanation 
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of the data is as plausible as the researcher's a priori model simply because the researcher did not 
generate the alternatives a priori” (p.l97). 

In the above comparison of various competitive models, two remarks can also be made as 
regards model evaluation. First, by the standard of goodness of fit indexes alone, a lot of the models 
. being evaluated are extremely good. However, more detail inspection of the parameter estimates 
reveal that most of these models are improper. In line with Bollen and Long's (1993, pp.6-7) 
recommendation, this points out the importance of examining the parameter estimates of various 
components in model evaluation. ' . . 

Second, when the TLI and RNI indexes are compared, it can be seen that their respective 
values can differ significantly. For example, in the moral judgement simplex-model analyses for the 
Chinese sample (see Table 1), TLI is .93 when RNI is .99. But when TLI drops to .74 RNI is still .96. 
Again a further drop of TLI to .39 only results in a small drop of RNI to .90. It is clear that these two 
indexes may not be working on the same metric. Along with Bollen and Long (1993), Gerbing and 
Anderson, and Marsh, Balia and Hau (1994), we recommend that researchers should consider a 
number of appropriate indexes in model evaluation. 

All in all, the present study evaluate a great number of competitive models of moral 
development. This process is important and can help to eliminate some possibilities or suggest 
alternative explanations. This is definitely not an number crunching game as some researchers may 
have misconceived (e.g., Randhawa & Beamer, 1994). Perhaps, Bollen’s (1989, see also Bollen & 
Long, 1993, p.7) advice should be reiterated, "We need to examine other plausible specifications that 
fit; we need to explore various avenues to assess whether a model has a reasonable correspondence to 
reality” (p.72). 
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Figure 1 

Quasi-Simplex Model of Moral Development 
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